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Abstract 

We propose an adaptive version of the total variation algorithm proposed in [3j for computing the 
balanced cut of a graph. The algorithm from [3] used a sequence of inner total variation minimizations 
to guarantee descent of the balanced cut energy as well as convergence of the algorithm. In practice 
the total variation minimization step is never solved exactly. Instead, an accuracy parameter is specified 
and the total variation minimization terminates once this level of accuracy is reached. The choice of this 
parameter can vastly impact both the computational time of the overall algorithm as well as the accuracy 
of the result. Moreover, since the total variation minimization step is not solved exactly, the algorithm 
is not guarantied to be monotonic. In the present work we introduce a new adaptive stopping condition 
for the total variation minimization that guarantees monotonicity. This results in an algorithm that is 
actually monotonic in practice and is also significantly faster than previous, non-adaptive algorithms. 

1 Introduction 

Recent works [151 1161 1 1UI 1111 [5] [31 1131 1171 1121 [3] have exploited advances in total variation minimization, 
originally developed for applications in image processing, to tackle fundamental problems in machine 
learning. The total variation of an image, described by a function f(x, y) : [0, 1] x [0, 1] — > R, is given by 

Wfhv = [ \Vf{x,y)\&xdy. (1) 

V[0,l]x[0,l] 

The total variation can also be given a sense in the context of graph theory: given a weighed graph 
with vertices V = {xi, . . . ,x n } and weights {wij}i<i t j<„ on its edges, the total variation of a function 
/ : V — ¥ K, is given by 

\\f\\TV = Ylw ij \f(x i )-f(x i )\. (2) 

i,j 

Minimizing energies involving (JTJ or @ is challenging due to the nonlinear and non-differentiable nature 
of the problems. In the past five years however, important mathematical breakthroughs together with 
faster computers have given rise to efficient algorithms for total variation minimization [9][l][6]- These 
advances have opened many possibilities in imaging sciences, and nowadays the total variation functional 
plays a central role in image processing for de- noising and segmentation problems. Recent works [151 1161 
1101 111! [2l [4] 1131 1171 1121 [3] have applied total variation techniques in machine learning and demonstrated 
they represent a set of very promising tools that we broadly refer to as "Total Variation Clustering." 

Given a set of data points V = {x\, ■ ■ ■ , x n } and similarity weights {wij}i<ij< n between these data 
points, the Balance Cut Problem 7, 8 j is: 

Cuffs S c ) 

Minimize C(S) := ,, ' , „ ,. over all subsets S C V. (3) 
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Here the numerator Circ^S, S c ) stands for ^ l gS x eSC Wij, and the term \S\ in the denominator denotes 
the number of data points in S. The balance cut problem ([3]) attempts to partition the dataset into two 
groups of comparable size that are weakly linked. The Balanced Cut problem is an NP-hard problem. 
However, several recent works [151 [3] have shown that the combinatorial problem (J3| is equivalent to the 
following continuous relaxation 

Minimize E(f) :— -— — ^ ^ l — — over all non-constant / £ R", (4) 
||/-med(/)l||i 

called the TV-Balanced Cut. Here ||/||i = |/;| denotes the l\ norm of / and med(/) denotes the 
median of /, i.e. the n/2 smallest entry when n is even. The problem @ is non-convex, but is provably 
equivalent to the original problem. Specifically, a one-to-one correspondence exists between the global 
minimizers of each problem. Moreover, the continuous problem is much easier to optimize. The lack of 
convexity means that the resulting optimization can have difficulties with local minima, however. 

Several algorithms have appeared that attempt to minimize the TV-Balanced Cut. In this work 
we propose a new adaptive total variation algorithm that, to the best of our knowledge, provides the 
fastest and most reliable approach. Our previous algorithm [3] utilized a sequence of "inner" total 
variation minimizations to guarantee descent of the TV-Balanced cut energy as well as convergence of 
the algorithm: 



Algorithm 1 TV algorithm for computing the Balanced Cut 

f° non-constant function with mcd(/) = and ||/°||2 = 1- 
while E(f k ) - E(f k+1 ) > TOL do 

v k ec>o||/ fe ||i 

gk _ j-k _|_ v k 

h k = argmm{||tt||Tv + < \\u — <? fe ||!} 
ftg = h k -med(/i fc )l 

fk+l _ h o 

1 \Hh 
end while 



In practice the total variation minimization step (also known as the ROF problem [14]), 

h k = argmin(jjitjj T v + \ \u- g k \\{\ (5) 

is never solved exactly. Instead, a total variation minimization algorithm, such as those proposed in 
[3 [HE], will generate a sequence of iterates {h k }'^. 1 that converge toward the exact solution h k defined 
by JjjJ. An accuracy parameter e > is then specified and the total variation minimization algorithm 
terminates once 

\\hi+i - fti||a < e- (6) 

The choice of the parameter e can vastly impact both the computational time of the overall algorithm 
as well as the accuracy of the result. It remains unclear how to properly choose the level of accuracy to 
obtain the right balance between these two aims. In addition all theoretical properties of this algorithm, 
along with any other algorithm proposed for the TV-Balanced Cut, are derived under the assumption 
that the total variation solution is exactly obtained. They therefore no longer hold in the actual im- 
plementation of the algorithm. The most important of these properties is monotonicity, i.e. that the 
TV-Balanced Cut energy is guaranteed to decrease E(f k+1 ) > E(f k ) at every outer iteration. In this 
work, we propose an adaptive stopping condition for the total variation minimization that still guarantees 
monotonicity of the algorithm. This results in an algorithm that is actually monotonic in practice and is 
more than two times faster on benchmark databases, such as the MNIST database, without sacrificing 
accuracy of the result. The key idea lies in solving the total variation step only to the amount needed to 
obtain "sufficient energy descent," where "sufficient" has a precise mathematical meaning that guarantees 
the important theoretical properties of the idealized algorithm still hold. 
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2 The Proposed Algorithm 



We propose to replace the stopping condition (0, which is used by all TV-Balanced cut algorithms to 
date [151 1161 1101 1111 [3J, by an adaptive stopping condition that guarantees monotonicity and results 
in a significantly more efficient algorithm overall. The genesis of this idea lies in the following energy 
inequality 

that holds for the idealized algorithm above. See [3j for a proof of this result. This inequality guarantees 
that the energy E(f) decreases by at least 

E{f k )\\h k - f k \\l 
\\h k - med(/i fc )l||i 

at every iteration. Moreover, this energy inequality forms the basis of the proof for the theoretical 
properties of the idealized algorithm. 

Our adaptive stopping condition simply uses a relaxed version of this inequality 0. Fix 8 £ (0, 1) 
and let {h k denote the sequence of iterates generated by a total variation minimization algorithm 
solving the inner problem ©. Since Km j ^ +00 h k — h k , we have 

i->+oo [ — med(/i")l||i J \\h K — med(/i K )l||i 

The above equality comes from the continuity of each of the following: the energy E\ the median; the 
l\ norm; and the £2 norm. That ([8J holds with strict inequality follows as a consequence of (JJJ together 
with the fact that 9 < 1. From ([5| it is clear that for i large enough the following holds: 

^) + S^7=l0f <g(/ fc )- 0) 
\\h1 - med(ft£)l||i 

In this work, we propose to use inequality ((9| as the stopping criteria when solving the inner problem 
([5]). This leads to the proposed algorithm: 

Algorithm 2 Adaptive TV algorithm for computing the Balanced Cut 

f° non-constant function with med(/) = and ||/°||2 = 1, 9 = .99. 
while E{f k ) - E{f k+1 ) > TOL do 

v k ed \\f% 

gk _ jk _|_ y k 

Solve h k k, argmin{||u|| Ty + ^0-\\u - g k \\l} until 



h k = h k - med(/i fe )l 
jk+l _ fag 



Ml 
end while 



The notation v k G 9o||/ fc ||i means that v k denotes any element of the sub-differential <9||/ fc ||i of the 
^i-norm at / that has zero mean. Note that <9o||/ fe ||i is never empty due to the fact that /* has zero 
median. Indeed, we can take the particular choice of v k G R" due to |10] . 



S (n- - n+)/(no) if/ fc (a; l )=0' 
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where n + , n~ and n° denote the number of elements in the sets {xi : f(xi) > 0}, {xi : f(xi) > 0} and 
{xi : f(xi) = 0}, respectively. Other possible choices also exist, so that v is not uniquely defined. This 
idea, i.e. choosing an element from the sub-differential with mean zero, was introduced in [TO] and proves 
indispensable when dealing with median zero functions. 

We choose the parameter 9 close to one, e.g. 9 = 0.99, in our implementation of the proposed 
algorithm. We keep 9 strictly smaller than one so that we can guarantee the stopping condition © is, 
in fact, reached in a finite number of iterations. Our experiments have indicated that a larger choice for 
9 leads to a more efficient algorithm. In the actual implementation of the algorithm we do not observe 
any difference between choosing 9 = 0.99, 9 = 0.999 or 9 = 0.9999. 

The new stopping criterion @ has three significant advantages over the more traditional stopping 
criterion © used in [T5l [T6l [TOl ITT1 13] . 

1. Monotonicity: With the new stopping criterion the energy E(f) is guaranteed to decrease at 
every step of the outer loop. In other words, the algorithm as implemented is now truly monotonic. 
Indeed, the stopping condition (|9} was specially designed to achieve this. The fixed, non-adaptive 
condition ([6]) simply does not guarantee monotonicity in the implemented algorithm. 

2. Robustness with Respect to Choice of Parameters: We observe in our experiments that the adap- 
tive algorithm is not sensitive to choice of the parameter 9 as long as 9 ~ 1 and 9 < 1. Specifically, 
9 — .99 (or 9 = .999) is nearly optimal for any dataset. This markedly contrasts with the old 
stopping criterion ([6|; the non-adaptive algorithm is very sensitive to the choice of the parameter 
e in terms of both accuracy and efficiency. Moreover, the proper choice of e may vary significantly 
between two different datasets. 

3. Speed: The proposed algorithm is adaptive in the sense that it does not waste computational effort 
in solving the inner loop to a greater precision than needed. In contrast, the non-adaptive algorithm 
solves the inner problem to the same degree of precision at every outer step of the algorithm. Overall 
this results in a significant gain in efficiency. 



3 Notation and Properties of the Algorithm 

In this section we first provide the complete, formalized implementation details for the algorithm de- 
scribed above. We then proceed to develop its mathematical properties. 

3.1 Notation 

First, we recall the definition of the subdifferentials of the TV semi-norm ||/||ti/ and the l\ norm ||/||i 
at /: 

Q\\$\\tv {v £ R n : \g\ \tv - \ \f\ \tv > (v, 9 - f) Vp £ R n } , (11) 
~{veR" : Hfflli-H/lli > (v,g-f) VgeR n }. (12) 

We denote by 9o||/||i those elements of the subdifferential <9||/||i that have zero mean. As the successive 
iterates f k have zero median, 9o||/ ||i is never empty. For example, we can take v £ R n so that 
v k (xi) = 1 if f(xi) > 0, v k (xi) = -1 if f(xi) < and v k (x l ) = (n~ -n + )/(no) if f{xi) = where n + , n~ 
and n° denote the number of vertices in the sets {xi : f{xi) > 0}, {xi : f{xi) > 0} and {xi : f(xi) = 0}, 
respectively. 

We next precisely define the approximate total variation step 

,fc . / E(f k ) k 2 \ 

h := approx arg mm s||it||rvH \\ u ~ 9 \\2( 

that we previously described. From our previous work [3], we know that if H k denotes the (unique) exact 
solution to the total variation minimization problem, 

H k :=argmin{|M| T v- + ^A|u-/|||), (13) 
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then H k satisfies the energy inequality Q 

In particular, we have that E(f k ) > E(H k ) unless Jf* = f k , i.e. / fc itself is the solution to the 
total variation minimization. In the latter case, it follows from the definition of H k that there exists 
w k £ <9||/ fc ||TV so that 

= w k + E(f k ){f k ~ g k ) = w k + E(f k )(f k - f k - v k ) = w k - E(f k )v k , 

which implies the current iterate f k is a critical point of the energy. 
Turning now to the approximate case, let 

{<f m (E(f k )-g k )}°° g k = f k + v k $\E(f k y,g k ) = f k 

L J m=l 

denote a sequence of iterates that converge to the exact solution starting from the initial point f k , i.e. 

$ m (E(f k );g k ) -> H k as m -> oo. 

In what follows, we use the shorthand $^ to denote $ m (£?(/*); g k ). If H k / f k , the continuity of the 
energy E, the median, the l\ norm and the £2 norm combine to show that for any 8 G (0, 1) there exists 
a finite Mk with the following property: 

E(f k ) < E($T) + e f ( - fL ^®l~ O}' 2 if 2<m<M k -l 
U K k > ||$--med($-)l|| 1 " " 

E(f k ) > E(* M -) + ^(/ fc )Pf fc -/ fc Hi 

E(j ) > E(<S k ) + —M- k j/ A M fc s lM • 

|$ fe -med($ fc fc )l||i 

We can only guarantee such an Mk exists provided 9 < 1, so in practice we take a value of 8 close to one, 
e.g. 8 = .99, as we have found this works best in practice. We then define 

f II II 1 E(f k ) II fc 1 12 "I iM,, -r rrk , rk 

approxargmin < ||u||tv H ^ — — g II2 u H =fc j 

{II II , E(f k ) .. fc||2l rfc -r rrfc -ffc 

2 J 

In the second case, i.e. when f k+1 = H k = / fe , we terminate the outer loop as well since the algorithm 
has reached a critical point of the energy. In practice, we set a maximum number of iterations m < M max 
that, if reached, signifies the "exact" solution of (|13[) has been found. 

3.2 Properties of the Approximate Algorithm 

We now proceed to demonstrate that, due to the control afforded us by the energy inequality, the 
approximate total variation algorithm still enjoys many of the same mathematical properties of the our 
previous idealized algorithm. We first demonstrate that the intermediate steps (h k ,h k ) in the iteration 
remain in a compact set. If h k = f then obviously ||/i fe ||2 — ||/ fc ||2 = 1 by definition of the iterates. 
Otherwise h k satisfies the energy inequality 



ffcMI i,k fk]]2 

E(f k ) > E(h k )+~ 



. , 8 E(f k )\\h k ~f k U2 



\h k - med(/i fc )l||i ' 

Note that each of the iterates /* belong to the closed subset 

So™" 1 := {/ € R n : ||/||a = 1 and med(/) = 0}. (15) 

of the £2 sphere. As Sq~ does not contain any constant functions and we assume a connected graph, 
E{f) > for all / G S™" 1 . Moreover, since Sq 1 is a closed set on which E is continuous, E attains 
a strictly positive minimum E(f) > E(f*) = a on <Sq _1 , so that E(f°) > E(f k ) > a uniformly for all 
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iterates. A combination of this fact with the triangle inequality and the facts that ||a;||i < y^ll^lh and 
|med(a;)| < ||a;||2 for all x £ R n then demonstrates 

|| A fc_/*||| < _ med(/»*)l||i < 4^^(1 + ^)11^112. 

By expanding the inner-product on the left hand side this reveals 

\\h k \\ 2 -2(h k ,f k ) + 1< ^^ll^-med^llli < ^!l^(l + ^)||/ t fc || 2 , 
oa oa 

which by Cauchy-Schwarz implies 

!jfe fc !|2<||ft fc ||2 + l< (^2+^p^(l + ^)) \\h k \\2. 

Dividing by 11/1*112 then yields the desired estimate that holds for all k > 0: 

\\h k w < (2+ 

\ aa 

In other words, the iterates h k lies in a fixed, compact set. Arguing as in [3], this allows us to obtain 

Lemma 1 (Compactness of .Asd)- Let f° £ 5J _1 and define a sequence of iterates (g k , h k , hg, f k+1 ) 
according to the approximate algorithm. Then there exists an R> independent of k so that 



<R and < \\ho\\ 2 < (1 + ^n)\\h k \\ 2 . (16) 



Moreover, we have 

\\h k -f k \\2 ->0, med^^O, ||/ fe -/ fc+1 || 2 ^0. (17) 

Proof. The first statement follows from the preceeding uniform compactness argument. That < 1 1 /iq 1 1 2 
follows since h k is not constant. Indeed, if h k = f k then h k £ <Sq -1 and is therefore not constant. 
Otherwise, that h k satisfies the energy inequality implies \\h k — med(/i fe )l||i > and again h k is not 
constant. The upper bound ||feo||2 < (1 + \/"')ll^-' C |l 2 follows from the triangle inequality. For the second 
statement, as / £ Sq ~ 1 it follows that E(f k ) > a > 0. From the energy inequality, 

\\h k -f k \\l < -^\\h k - med(h k )l\\ 1 (E(f k ) - E(f k+1 )) < C{E(f k ) - E{f k+1 )) ^ 0, (18) 

for some universal constant C, due to uniform compactness of the iterates. Convergence to zero follows 
as E(f k ) is decreasing and bounded from below, and therefore converges. By continuity of the median 
and the fact that med(/ fe ) = 0, any limit point of the {f k } must have median zero. As \\h k — f k \\2 — > 0, 
any limit point of the {h k } must also have median zero, which implies that med(h k ) -> as well. The 
triangle inequality then implies \ \hg — f k \ [2 — > 0, so that |/iq | [2 — > 1 and |/ fc+1 — f k 1 12 — > as desired. □ 

As a consequence of this lemma, we obtain the following corollary that shows the approximate algo- 
rithm and the idealized algorithm from [3] share the same global convergence properties: 

Corollary 1. Take f° £ and let {f k } denote any sequence defined through the approximate total 

variation algorithm. Then either the sequence {f k } converges or the set of accumulation points form a 
'0 



The Critical Point Property 

Next, we turn our attention to characterizing the limit points of the sequence {f k }. We wish to establish 
the critical point property, i.e. that any limit point of {f k } is a critical point of the energy. Specifically, 
if f°° denotes a limit point of {/ } then there exist v°° £ <?o 1 1 y 00 1 1 1 and w°° £ 9||/°°||tv so that 

— w — E(f )v . 
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To this end, let us suppose that we have a subsequence satisfying 

where the second statement follows from the statement \\f k — / fc+1 ||2 — > in the previous lemma. Note 
that the previous lemma implies h kj ,h 3 — > f°° as well. As {v kj } lie in a uniform compact set, as 
each entry of v * lies in [—1, 1], we can (by passing to a further subsequence if necessary) assume that 
v k i -> t>°° for some «°° el". By definition, for all g € R n we have that 

ll3l|i-||^l|i><A0-/ fcj > 

and (v kj , 1} = 0, which by passing to the limit kj — > oo in both statements reveals that v°° £ c?o 1 1 °° 1 1 i 
as well. 

Before we can establish the critical point property, we clearly must place at least some assumptions 
on the total variation solver ® m (E(f),g). Specifically, we make three assumptions 

Assumption 1. (Convergence) For every (E(f),g) the solver & m (E(f), g) is convergent, i.e. 

$ m (E(f),g) -> argminjlMlrv + ^fi\\u - g\\l 

Assumption 2. (Continuity of the Iterates) For every m > 1, the function (E(f),g) i-> & m (E(f), g) 
is continuous. 

Assumption 3. (The Semigroup Property) For any m,n > 1, if $ n (E(f),g) = & m (E(f),g) then 
<S> n+1 (E(f),g) = <S> m+1 (E(f),g) as well. 

We obviously require the first assumption, while the second assumption is reasonable and does in fact 
hold for the popular total-variation solvers. The third assumption essentially states that during iterative 
scheme, the next $ m+1 is determined entirely by the current iterate $ m , but not by multiple previous 
iterates or other auxiliary variables. This assumption fails for many of the popular total variation solvers 
such as the alternating direction method of multipliers or primal-dual algorithms. It does hold for so- 
called "first-order" solvers, however, such as straightforward gradient-descent, forward-backward splitting 
schemes or Uzawa iteration applied to the dual problem. We include it for simplicity in illustrating that, as 
a proof-of-concept, the control afforded us by the energy inequality allows us to retain in the approximate 
algorithm all convergence properties of the idealized algorithm. We leave the proof in the more general 
case to future work. 

Returning now to establishing the critical point property, assume that f°° is not a critical point of 
the energy. By definition, then, 

oidwrwrv-EiDv 00 & Q£d\\r\\Tv + E{D{r -o> g°° = r+v°°. 

In particular, 

rVargmin (||u|| TV + ^1 \\u - g°"\\l 

As before define 

Zr°:=argmin \\\u\\ T v + ||u - g°°\\l 



«6i™ L * 
along with the corresponding sequence of iterates 

$> m {E(D,g°°)^H°" as m->oo, $ 1 (E(f°°), g°°) = f°°- 

As f°° / H°° there exists a finite M with the property that (where $™ is shorthand for <f> m (E(f°°), g°°) 

E{f°°) < + *f (/ ~ CP if m<M-l 

w ' ~ ' ||$g» -med($S)l||i 

E(f°°) > E(<f> M ) + eE (/ 00 )ll^-/°°ll2 
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We may suppose that each of the iterates h kj came from an approximate total variation solve, i.e. 
h kj = $ M kj(E(f k J),g k J) for some finite iteration number . , since if this is not the case then the 
sequence {f k } reaches a critical point of the energy in a finite number of iterations. 

As f h J — > f°°, E(f h i) — > E(f°°) and g k i — > g°° and the approximate total variation procedure 
performed at (E(f ca )°° , g°°) terminates in M iterations, we would expect that for j large enough the 
approximate total variation procedure at (E(f kj ), g 1 j ) would also terminate in M iterations but here we 
must be a bit more careful. By the continuity of the iterates $ m we do have that the energy inequality 



6 E{f k i)\\^. - f k i\\l 
||$^ — med(<3> 



fkj \ ^ T?I*M , _ 



holds for all kj sufficiently large. In other words, there exists J so that if j > J then Mj < M. As 
Mj < M for all kj sufficiently large, this means that the entire sequence {Mj}j? =1 is, in fact, bounded. We 
may therefore extract yet another subsequence kj [ so that Mj l — > M* for some 2 < M* £ R. However, 
as the Mj t form a Cauchy sequence and are also integers (so, \Mj t — Mj , \ > 1 unless they are equal) this 
implies that in fact Mj t = M* € N for all I sufficiently large. So along this subsequence we also have 

h k n = , h k n , hi 31 , f k n +1 /°° , v k n ^ . 

That is, for I large enough the terminating index does not change. As h kj i — > f°°, it follows from 
continuity of the iterates that 

* M *(E{f°°),g°°) = f°° = $ 1 (S(/ 00 ), S 00 ). 

By the semigroup property, for any n £ N it follows that $ 1+n ( M * - 1 ) (E(f°°), g°°) = f°° as well. 
In particular, f°° appears infinitely often. As $ m converges as m — > oo, we then necessarily have 
$ m ( J B(/ 00 ), 3 00 ) = f°° for all m and lim m ^ 00 $ m ( J B(/°°), = f°° , that is: 

/°°=argmin f||«|| TV + ^1 \\u - g°°\\ 2 2 \ . (19) 



This contradicts the assumption that f°° is not a critical point of the energy, which completes the proof. 

Remark 1. While the semigroup assumption suffices to establish the critical point property, it often 
proves too restrictive. If instead we establish the existence of a strictly monotone quantity F, i.e. 
F($ m ) > F($ m+1 ) unless 5> m = <E> m+1 , such as the total variation energy or the residual then the 
same proof works even in the absence of the semigroup property. 

4 A stopping condition for the inner TV problem which 
does not involve computing the median 

In this section we present an alternative approximate total variation algorithm that avoids having to 
compute the energy E(& m ) at each iteration of the total variation solver. The motivation for this lies in 
the fact that other total variation clustering problems, such as TV-Normalized Cut, rely on energies with 
weighted medians that are expensive to compute. An algorithm that avoids this extra computation, yet 
still satisfies the energy inequality, would therefore produce an additional gain in efficiency. We develop 
this idea for the TV-Balanced Cut problem; the idea extends in a straightforward fashion to other total 
variation clustering problems. 

If we solve the inner total variation problem exactly, i.e. we compute 

H k : =ar gmin(|H| T v + ^p|| M - ff fc ||l 
then we have that 

E(f) (g k -H k ) ed\\H k \\ T v. 

In particular, this implies that 

\\f k \\ T v > \\H k \\ TV +E(f)(g k - H k ,f k -H k ) = \\H k \\ TV + E(f)\\H k - f k \$ - E(f k ){v k ,H k - f k ). 
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Now let 6 = .99 and := $ m (E(f k ),g k ) -> H k when m -> co as before. If H k = / fc then / fc is a 
critical point of the energy and we terminate the algorithm. Otherwise H k 7^ / so there exists a finite 
Mfe with the property that 

||/ fc ||TV < \m\\rv + e E(f k )\\$% - f k \\ 2 2 - E(f k ){v k ,$% - f k ) 2<m<M k -l (20) 
\\.f k \\TV > \\K Ik \\Tv + 0E(f k )\\<f^ - f k \\l - E(f k )(v k ,<£ A k ' k ~f k ), (21) 

and we set h k = $f k just as in the previous algorithm. Note that checking (|21[) only requires computing 
I^sTHtv and two inner-products at each iteration. It then follows, due to the fact that v k G <9o||/ fe ||i, 
that 

\\h k -med(/i fc )l||i - ||/ fc ||i > (v k ,h k - f k ). 
Multiplying this inequality by E(f k ) and adding it to the previous inequality yields 

||/ fe || T v + E(f k )\\h k - mfid(A*)l||i > £(/ fc )||/ fc ||i + \\h k \\ T v +6 E(f k )\\h k - f k \\l, 



or in other words 



rk,. „„ks & E(f k )\\h k - f k \\l 



That is, h k satisfies the desired energy inequality. As a consequence, all of the compactness and conver- 
gence results from the previous section hold, with only slight modification, for this algorithm as well. 

Using (|21[) as a stopping condition in the total variation minimization solver leads to the following 
variation of Algorithm [2] 

Algorithm 3 Variation of Algorithm [2] without median in inner stopping condition 

f° non-constant function with med(/) = and ||/°||2 = 1, 9 = .99. 
while E(f k ) - E(f k+1 ) > TOL do 

v k ed Q \\f k \\ 1 

gk _ fk _|_ y k 

Solve h k ss argmin{||u|| T y + E< "{ ^ \ \u - g k \\\} until 



\\f k \\TV > \\h k \\ T v + E(f k )\\h k - f k \\ 2 2 - E(f k )(v k ,h k - f k ), 



Hq = h k - med(/i fe )l 
fk+l _ 
J \Hh 
end while 



Local Convergence Results 

By leveraging the inequality (|21[) . we can demonstrate that this approximate algorithm satisfies the same 
local convergence properties as the idealized algorithm. Recalling the definition from [NIPS], we say 
that a set-valued algorithm A is closed at local minima (the CLM property) if f k f°° G S™' 1 and 
z k G A(f k ) then z k — > f°° whenever /°° is a local minimum of the energy. Note that the approximate 
algorithm defined above is, in fact, a set- valued algorithm due to the lack of uniqueness in v k , i.e. the 
choice of subdifferential. 

To demonstrate the CLM property for the approximate algorithm, suppose we have a sequence 
/* G Sq _1 converging to some f°° G Sq ~ 1 and let h k denote the corresponding sequence of intermediate 
steps. If h k 7^ / only finitely many times then the CLM property is immediate. Indeed, then h k = / for 
all k sufficiently large, which implies h k — >• /°°, h k -> f°° and z k := /lo/ll^olh — > f°° as well. Otherwise, 
h k 7^ f k infinitely many times. Given any subsequence of {z k } we may restrict attention to a further 
subsequence for which h kj 7^ f kj along the entire subsequence. As the h kj satisfy the energy inequality, 
they lie in a compact set. By passing to a further subsequence if necessary, we may therefore assume 
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that h k i -» h°° and v k i -> v x G d \ |/°° | |i while still retaining f k ' -> /°° and the fact that ft** satisfy 

mm. 

We now suppose that h°° 7^ /°° and shall obtain a contradiction. Indeed, if ft 00 7^ /°° then by passing 
to the limit we find 

II/HItv > llfc^llTv + fl^/ 00 )!^-/-^-^/ 00 )^^ 00 -/ 00 ). (22) 

For 77 G (0, 1) let h v := rjh°° + (1 - By convexity of the TV semi-norm, 

|Tv + (l-r?)||/°°||TV => i||fc,||TV-— -||/°°||rv < ||ft 00 |kv. 

77 77 



Substituting this estimate into (|22[) then shows 

-II/°°IItv > i||/> n ||iv + 0S(/ oo )||A 00 - /^m - -E{D{v°°,h v -f°°). 

V V V 

Once again, the fact that v°° G 9o||/°°||i implies 

lift,, - med(/i I) )l||i > + (v°°,h v - f°°). 

Multiplying this inequality by E(f°°) and adding it to 77 times the previous inequality then shows 



E(D\\K -med(^)l|| 1 > \\h v \\ T v +vOE(f°°)\\h x - f 



00 1 1 2 

2- 



We may assume that h v is not constant, since otherwise this would imply h°° = f°° as desired. We may 
therefore divide by \\h v — med(/i r| )l||i in the previous inequality to obtain 

E(f°°)>E(h v )+r l 0E(f a >)-. 11 ' I|J 



\h v — med(/i I) )l| |i 



If \\h°° - f°° ||1 > this would imply E(h v ) < E(f°°) for any 77 that is strictly positive. As h v -> /°° 
as — >• this would contradict the fact that f°° is a local minimum of the energy, whence h°° — f°° 
as desired. Thus any subsequence of h k has a further subsequence that converges to f°°, meaning the 
whole sequence converges to this limit. This then implies that ho — > f°° and z k — > f°° as well, and this 
establishes the CLM property for the approximate algorithm. 

To formulate a notion of local convergence, we need an analogue of a "strict" local minimum of 
the TV-Balanced Cut energy. Due to the invariance of this energy under scaling and the addition of 
constants, we cannot refer to a local minimum as "strict" in the usual sense. We must therefore remove 
the effects of these invariances when referring to a local minimum as strict. To this end, define the 
spherical and annular neighborhoods on 5 ( ™ _1 by 

== {11/ - /"lla < e} n Sr 1 A S , C (D := {5 < \\f - /°°||a < e} n SS~\ 
With these in hand we introduce the proper definition of a strict local minimum. 

Definition 1 (Strict Local Minima). Let f°° G 5q 1_1 . We say f°° is a strict local minimum of the 

energy if there exists e > so that f G B e (f°°) and f f°° imply E(f) > E(f°°). 

The CLM property now allows us to quote a general result from [3] that establishes a local stability 
property for the approximate algorithm: 

Lemma 2 (Lyapunov Stability at Strict Local Minima). Fix f° G Sq _1 and let {f k } denote any sequence 
corresponding to the approximate algorithm. If f°° is a strict local minimum of the energy, then for any 
e > there exists a 7 > so that iff G B 7 (/°°) then {f k } C B e {f°°). 

Loosely speaking, this means that if we have a good initial guess for the solution of the TV-Balanced 
Cut problem then the approximate algorithm defined above will remain close to this initial guess while 
simultaneously lowering the TV-Balanced Cut energy. We emphasize that this property holds regardless 
of any assumptions made about the total variation solver <E> m other than convergence, e.g. the semigroup 
property. If we further assume the continuity and semigroup properties of the solver then this approximate 
algorithm satisfies the critical point property as well. In this case, the remaining theory of [5] applies 
and we do, in fact, recover precisely all of the theoretical properties of the idealized algorithm with this 
approximate total variation algorithm. 
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5 Numerical Experiments 



All experiments that follow use a symmetric fc-nearest neighbor graph combined with the weight similarity 
function Wi,j = exp(— r 2 j/a 2 ). Here, rij — \\Xi— Xj\\2 and the scale parameter a 2 = 3d 2 ., where dk denotes 
the mean distance of the k th nearest neighbor. 

We use the two-moon, MNIST and USPS datasets. The two-moon dataset [5] uses the same setting 
as in [16]. We take k = 5 nearest neighbors to construct the graph. We preprocessed the MNIST and 
USPS data by projecting onto the first 50 principal components, and take k — 10 nearest neighbors for 
the MNIST and USPS datasets. 

We use Algorithm [3] and the method from [6] to solve the inner ROF problem ([5]). We terminate each 
inner loop when either the condition 

||/*||tv > \\h k \\ T v +9 E{f k )\\h k - f k \\l - E(f k ){v k ,h k - /*), 

is satisfied or 1, 500 iterations is reached (meaning that the solution has been found). We take 6 — 0.99 
is all experiments. 

The following table summarizes the results of these tests. It shows the mean error of classification 
(% of misclassified data) and the mean computational time for the proposed algorithm and the previous 
algorithm from [3] over 10 experiments. 





Adaptive Algorithmic] 


Non-adaptive Algorithm from [3] 


Error (%) 


Time 


Error (%) 


Time 


2 moons 


9.06 


2.03 sec. 


8.69 


2.06 sec. 


MNIST (10 classes) 


11.76 


21.85 min. 


11.78 


45.01 min. 


USPS (10 classes) 


4.11 


3.08 min. 


4.11 


5.15 min. 



Reproducible research: The code is available at jhttp: / / www.es. cityu.edu.hk/^xbresson/codes.htrnI| 

Acknowledgements: This work supported by AFOSR MURI grant FA9550-10- 1-0569, NSF grant 
DMS-0902792, and Hong Kong GRF grant #110311. 



References 

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse prob- 
lems. SIAM Journal on Imaging Sciences, 2(l):183-202, 2009. 

A. Bertozzi and A. Flenner. Diffuse Interface Models on Graphs for Classification of High Dimen- 
sional Data. Multiscale Modeling and Simulation, 10(3):1090-1118, 2012. 

X. Bresson, T. Laurent, D. Uminsky, and J. von Brecht. Convergence and energy landscape for 
cheeger cut clustering. In Advances in Neural Information Processing Systems (NIPS), pages 1394- 
1402, 2012. 

X. Bresson, X.-C. Tai, T.F. Chan, and A. Szlam. Multi-Class Transductive Learning based on I 1 
Relaxations of Cheeger Cut and Mumford-Shah-Potts Model. UCLA CAM Report, 2012. 

T. Biihler and M. Hein. Spectral Clustering Based on the Graph p-Laplacian. In International 
Conference on Machine Learning, pages 81-88, 2009. 

A. Chambolle and T. Pock. A First-Order Primal-Dual Algorithm for Convex Problems with Ap- 
plications to Imaging. Journal of Mathematical Imaging and Vision, 40(1):120-145, 2011. 

J. Cheeger. A Lower Bound for the Smallest Eigenvalue of the Laplacian. Problems in Analysis, 
pages 195-199, 1970. 

F. R. K. Chung. Spectral Graph Theory, volume 92 of CBMS Regional Conference Series in Mathe- 
matics. Published for the Conference Board of the Mathematical Sciences, Washington, DC, 1997. 

T. Goldstein and S. Osher. The Split Bregman Method for Ll-Regularized Problems. SIAM Journal 
on Imaging Sciences, 2(2):323-343, 2009. 



11 



[10] M. Hein and T. Biihler. An Inverse Power Method for Nonlinear Eigenproblems with Applications 
in 1-Spectral Clustering and Sparse PCA. In In Advances in Neural Information Processing Systems 
(NIPS), pages 847-855, 2010. 

[11] M. Hein and S. Setzer. Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts. 
In In Advances in Neural Information Processing Systems (NIPS), 2011. 

[12] E. Merkurjev, T. Kostic, and A. Bertozzi. An mbo scheme on graphs for segmentation and image 
processing. UCLA CAM Report 12-46, 2012. 

[13] S. Rangapuram and M. Hein. Constrained 1-Spectral Clustering. In International conference on 
Artificial Intelligence and Statistics (AISTATS), pages 1143-1151, 2012. 

[14] L. I. Rudin, S. Oshcr, and E. Fatemi. Nonlinear Total Variation Based Noise Removal Algorithms. 
Physica D, 60(l-4):259 - 268, 1992. 

[15] A. Szlam and X. Bresson. A total variation-based graph clustering algorithm for cheeger ratio cuts. 
UCLA CAM Report 09-68, 2009. 

[16] A. Szlam and X. Bresson. Total variation and cheeger cuts. In Proceedings of the 27th International 
Conference on Machine Learning, pages 1039-1046, 2010. 

[17] Y. van Gennip and A. Bertozzi. Gamma-convergence of graph ginzburg-landau functionals. Advances 
in Differential Equations, 17(11-12):1115-1180, 2012. 



12 



